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(57) Abstract 

A method for object-based video description and linking is disclosed. The method constructs a companion stream for a video sequence 
which may be in any common format. In the companion stream, textual descriptions, voice annotation, image features, URL links, and 
Java applets may be recorded for certain objects in the video widiin each frame. The system includes a c^ture mechanism for generating 
an image, such as a video camera or computer. An encoder embeds a descriptive stream with the video and audio signals, which combined 
signal is transmitted by a transmitter. A receiver receives and displays the video image and the audio. The user is allowed to select whether 
or not the embedded descriptive stream is displayed or otherwise used. 
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DESCRirndN 

A METHOD AND SYSTEM FOR OBJECT-BASED VIDEO DESCRIPTION AND 

LINKING 

Field of the Invenrion 

This invention relates to an object-based description and linking metiiod and 

system for use in describing the contents of a video and linking such video contents with other 

multimedia contents. 

Backgronnd nf the Invention 

In this inforaiation age, we daily deal with vast amount of video information when 
watching TV, making home video, and browsing the World Wide Web. The video which we 
receive or make is mostly in an "as is" state, Le., tiiere is no further inforaiation available about 
the content of the video, and the content is not linked to other related resources. Because of this, 
we view video in a passive manner. It is difficult for us to interact with the video contents and 
utilize them efficiently. ¥mm time to time, we see someone or something in the video about 
which we would like to find more information. Usually, we do not know where to find such 
information, and do not begin or continue our quest It is also difficult for us to search for video 
clips which may contain certain content related to our interests. 

Existing multimedia descriptive networking methods and languages comprise the 
known art Examples of such methods include the descriptive techniques used in connection with 
digital libraries and computer languages, such as HTML and Java. The existing methods used in 
digital libraries suffer from shortcomings in that they are not necessarily object-based, e.g., the 
methods that use color histograms describe only the global color contents of a picture and do not 



describe the contents of the picture; linking and netwoddng capabiliiy is not inherent in the 
systems; and, the video sources must be of a specific type in order to be compatible with the 
primary language. Languages such as HTML and Java are difficult to use for describing and 
liniring video contents in a video sequence, especially when it is desired to treat the video 

sequence at the object level 

If a video sequence were to be accompanied by a stream of descriptions and links 
that provided additional information about the video, and which woe embedded in die video 
signal, we could find further information about certain objects in the video by looking up dieir 
descriptions, or visiting their related Web sites or files, by following the embedded links. Such 
descriptions and linirs may also jffovide usefiil information for content-based searcteng in digital 
libraries. 

SnmmaTv of the Invention 
A new method and system for objea-based video description and linking is 
disclosed. The method constructs a companion stream for a video sequence which may be in any 
common format In the companion stream, textual descriptioi^s, voice annotation, image features, 
object linv-g, URL links, and Java applets may be recorded for certain objects in the video within 
each ftame. Tlie method may be utilized in many applications as described below. 

The system of the invention includes a mechanism for genraating an encoded 
image. An encoder embeds a companion descriptive scream with a video signal. A video display 
displays the video image. The user is allowed to select whetiier or not the embedded descriptive 
stream is displayed or otherwise used. 

It is an object of die invention to develop a method and system for describing and 
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linking video contents in any format at the video object leveL 

It is a further object of the invention to allow a video object to be linked to other 
video/audio contents, such as a Web site, a computer file, or other video objects. 

These and other objects and advantages of the invraition will become more fiilly 
{^parent as the description winch follows is read in connection with the drawings. 

Brief Description of the Drawings 

Fig. 1 is a block diagram of the method of the invention. 

Hg. 2 is an illustration of the various types of links that may be incorporated into 
die invention of Hg. 1 . 

Fig. 3 is a block diag r am of the system of the invention as used within a television 
broadcast scheme. 

Detailed DescriptioD of the Preferred Ernhndiment 
A new method for describing and linking objects in an image or video sequence is 
described. Hie method is intended for use with a video system having a certain digital 
component, such as a television or a computer. It should be appreciated that the method of the 
invention is able to provide additional description and links to any foniiat of image or vic^^ 
While the mediod and system of the invention is generally intended for use with a video sequence, 
such as in a television broadcast, video tape or video disc, or a series of video ftames viewed on a 
computer, die method and system are also applicable to single images, such as might be found in 
an image database, and which are encoded in well-known formats, such as JPEG, MPEG, binary, 
etc., or any other format As used herein, **video" includes die concept of a single "image,'* 

Referring now to Fig. 1, the method, depicted generally at 10, builds a description 



stream 12 as a compamon for a video sequence 14, having a plural fi^ 16 therein. In each 
selected firame, Aere may be one or more objects of interest, such as object 16a and object 16b. 
It wiU be appreciated by those of skiU in the art that not aU of the frames in v^ 14 
must be selected for having a companion descEiptive stream linked 

The descriptive stream records farther information about certain objects appearing 
in the video. The stream consists of continuous blocks 18 of data where each block corresponds 
to aframe 16 in the video sequence and a frame index 20 is recorded at &e beginning of the 
block. The ""object of interest may comprise the entire video frame. Additionally, a descriptive 
stream may be linked to a number of frames, which frames may be sequential or non-sequentiaL 
In the case where a descriptive stream is linked with a sequential number of fr^es, the 
descriptive stream may be thought of as having a "lifespan," Le., if the user does not take some 
action to reveal the descriptive stream when a linked frame is displayed, the descriptive stream 
""dies," and may not, in the case of a television broadcast, be revived. Of course, if the descriptive 
stream is part of a video tape, video disc, or computer file, the user can always return to tiie 
location of the descriptive stream and display the informatioiL Some form of visible or audible 
indicia may be displayed to indicate that a descriptive stream is linked with a sequence of video 
frames. Descriptive stream 12 may also be linked to a single image. 

The frame indexes are used to synchronize the descriptive streams with the video 
sequences. Ihe block may be further divided into a number of sub-blocks 22, 24, containing what 
are referred to herein as descriptor/links, where each sub-block corresponds to a certain individual 
object of interest appearing in the frame, Le., sub-block 22 corresponds to one object 16a in tiie 
frame and sub-block 24 corresponds to another object 16b in the same frame. There may be other 
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objects in the image that are not defined as objects of interest, and which, therefore, do not have a 
descriptive stream and sub-block associated therewith. A sub-block includes of a number of data 
fields including but not limited to object index, textual description, voice annotation, image 
features, object links, URL links, and Java applets. Additional information may include notices 
regarding copyright and other intellectual property rights. Some notices may be encoded and 
rendered invisible to standard display equipment 

The object index field is used to index an individual object widiin the frame. It 
contains the geometrical definition of the object When a user pauses, or captures, the video at 
some firame, the system processes all the object index fields within that frame, locates the 
corresponding objects, and marks them in some mamier, such as by highlighting rii em Hie 
highlighted objects are those that have further information recorded. If a user ^'clicks'* on a 
highlighted object, the system locates the corresponding sub-block and pop-up menu containing 
the available information itrais. 

A textual description field is used to store further information about the object in 
plain text This field is similar to the traditional closed caption, and its contents may be any 
information related to the object The textual description can help keyword-based search for 
relevant video contents. A content-based video search engine may look up the textual 
descriptions of video sequences trying to match certain keywords. Because the textual 
description fields are related to individual objects, they enable truly objea-based search for video 
contents. 

A voice aimotation field is used to store further information about the object using 
natural speech. Again, its contents may be any information related to the object 



An image features field is used to store further iiifoimation about the object in 
terms of texmre, shape, dominant color, motion model describing motion with respect to a certain 
reference &ame, etc.. Image features may be particularly useM for content*based video/image 
indexing and letzieval in digital libraries. 

5 An object links field is used to store links to odier video objects in the same or 

other video sequence or image. Object links may be useful for video summarization and 
object/evmt tracking. 

The URL links field, which is illustrated in Hg. 2, is used to store links to Web 
pages and/or other objects which are related to the object For a person in the scene, such as 

10 person 26, Le., the object of interest, the link in the sub-block 28 may be pointed to a URL 30 for 
the person's personal homepage 32. A symbol or icon in the scene may be linked to a Web site 
which contains the related background information. Companies may also want to link products 
34 shown in the video, through a sub-block 36 to a URL 38 to their Web site 40 so that potential 
customers may learn more about their products. 

IS A Java applet field is used to store Java code to perfomi more advanced functions 

related to the object For example, a Java applet may be embedded to enable online ordering for a 
produa shown in the video. Java code may also be written to implement some sophisticated 
sioiilarity measures to empower advanced content-based video search in digital libraries. 

In the case of digital video, the cassettes used for recording in such systems may 

20 have a solid-state memory embedded therein which serves as an additional storage location for 

information. The memory is referred to as memory-in-cassette (MIC). Where the video sequence 
is stored on a digital video cassette, the descriptive stream may be stored in the MIC, or on the 

6 
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video tape. In genial* the descr^)tive stzeam may be stored along with die video or image 
contents on the same media, Le., a DVD disc or tape. 

Figure 3 depicts tibe system of the invention, generally at 50, as is used in a 
television broadcast scheme. System SO includes a capture mechanism, which may be a video 
camera, a computer enable of generating a video signal, or any other mechanism tf^^t is able to 
generate a video signal A video signal is passed to an encoder 54, which also receives 
appropriate companion signals from the various ^pes of links which will form the descriptive 
stream, which encoder generates a combined video/descriptive stream signal 58. Signal 58 is 
transmitted by transmitter 60, which may be a broadcast transmitter, a hard-wire system, or a 
combination thereof. The combined signal is received by receiver 62, which decodes the signal 
and generates an image for display on video display 64. 

A trigger mechanism 66 is provided to cause receiver 62 to decode and display Ae 
descriptive stream. A decoder, in this embodiment, is located in receiver 62 for decoding die 
embedded descriptive stream. Tlie descriptive stream may be displayed in a picture-in-picture 
(PIP) foraiat on video display 64, or may be displayed on a descriptive stream display 68, which 
may be co-located with the trigger mechanism, which may take the form of a remote control 
mechanism for die receiver. Some form of indicia may be provided, either as a visible display on 
video display 64, or as an audible tone, to indicate that a descriptive stream is present in the video 
sequence. 

Acdvanng trigger mechanism 66 when a descriptive stream is present will likely 
result in those objects which have descriptive streams associated therewith being highlighted, or 
otherwise marked, to tell the user that additional information about the video object is present 
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The data block inf onnalion is displayed in the desciiptive stream display, and the devise 
manipulated to aUow the user to select and activate d^^ The 
infomiation may be dispkyed immediately, or may be stored for futu^ Of key 

importance is to allow the video display to continue uninterrupted so that odiers watching the 
display will not be compelled to remove the remote control from the possession of the user who is 
seeking additional information. 

In the event diat the system of the invention is used with a digital library, on a 
computer system for instance, capture mechanism 52, transmitter 60 and recdver 62 may not be 
required, as the video or image will have already been cq}tured and stored in a library, which 
library likely resides on magnetic or optical media which is hard-wired to the video or image 
display. In this embodiment, a decoder to decode the descriptive stream may be located in the 
computer or in die display. The trigger mechanism may be combined with a mouse or odier 
pointing device, or may be incorporated into a keyboard, either with dedicated keys, or by the 
assignment of a key sequence. The descriptive stream display wiU likely take the form of a 
window on the video display or monitor. 

A pplications 

Broadcasting TV Programs 

TV stations may utilize the method and system of the invention to add more 
functionality to their broadcasting programs. They may choose to send out descriptive streams 
along with their regular TV signals so that viewers may receive the programs and utilize the 
advanced functions described herein. The scenario for a broadcast TV station is similar to that of 
sending out closed caption text along with regular TV signals. Broadcasters have the flexibility of 

8 
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choosing to send or not to send the descnpnve Streams for Xfareceiving 
TV set has the capability of decoding the descriptive streams, the viewer may choose to use or 
not use the advanced functions, just as the viewer may choose to view or not to view closed 
caption texL If the user chooses to use the functions, the user may read extra text about someone 
or something in the programs, hear «tra voice annotations, or go diiecdy to the related Web 
site(s), if the TV set is Web enabled, or perform some tasks, such as online ordering, by running 
the embedded Java applets. 

For a video sequence, the descriptive stream may be obtained through a variety of 
mechanisms. It may be constructed manually using an interactive method. An operator may 
explicitly choose to index certain objects in the video and record some corresponding further 
information. The descrq)tive stream may also be constructed automatically using any video 
analysis tools, especially those to be developed for the Moving Pictures Experts Group Standard 
No. 7 (MPEG-7). 
Consumer Home Video 

The method and system of the invention may be utilized in making consumer 
video. Camcorders, VCRs and DVD recorders may be developed to allow the constraction and 
storage of descriptive streams while recording and editing. Those devices may provide user 
interface programs to allow a user to manually locate certain objects in their video, index tiiem, 
and recording any corresponding iof ormation into the descriptive streams. For example, a user 
may locate an objea within a frame by specifying a rectangular region which contains the object 
The user may then choose to enter some text into the texmal description field, record some 
speech into the voice annotation field, and key in some Web page address into the URL linir^ 



fieI(L The user may choose to allow the programming of the device to propagate those 
descriptioiistothesurroimdingfiames. This may be done by tracking the objects in die nearby 
frames. The recorded descriptions for certain objects may also be used as their visual tags. 

If a descriptive stream is recorded along with a video sequence as described above, 
that video can then be viewed lat^ and support all the functions as described above. 
Digital Video/Image Databases 

As previously noted, the method and system of the invention may also be used in 
digital libraries. The method may be ^plied to video sequences or images originally stored in any 
common format including RGB, Dl, MPEG. MPEG-2. MPEG-4, etc. If a video sequence is 
stored in MPEG-4, the location information of the objects in the video may be extracted 
aummatically. This eases the burden of manually locating them. Further information may then be 
added to each extracted object within a frame and propagated into other sequential or non* 
sequential frames, if so selected. When a sequence or image is stored in a non-object*based 
format, the mechanism described herein may be used to construct descriptive streams. This 
enables a video sequence or image stored in one format to be viewed and manipulated in a 
different format, and to have the description and linking f eamres of the invention to be applied 
thereto. 

The descriptive streams facilitate content-based video/image indexing and retrievaL 
A search engine may find relevant video contents at the object level by matching relevant 
keywords against the text stored in the textual description fields in the descriptive streams. The 
search engine may also choose to analyze the voice annotations, match the image features, and/or 
look up the linked Web pages for additional information. The embedded Java applets may 

10 
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implement moie sophisticaied similaiity measores to farther enhance content-based video^age 
indexing and retrieval 

Thus, a method and system for object-based video description and linking has been 
disclosed. It will be appreciated that variations and modifications thereof may be made within the 
5 scope of the invention as defined in the appended claims. 
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CLAIMS 



^ A melbod of object-based desciiptioa and linking of objects widun an im^e, 

composing: 

g en e ra ting a descriptive stream, including a data blodc, for tiie image; 
identifying at least one object of interest in tb& image; 
inserting descriptLon/linlcs into ibe data block for an objert of intwest; and 
recording a fiame index at the beginning of each data block for synchronizing die 
descnption/lioks with the image. 

2. The method of claim 1 wherein said inserting of desciiption/links includes inserting 
description/links taken ficom the group of description/links consisting of object indexes, textual 
descriptions, voice annotation, image fearures, object links, URL links and Java applets. 

3. The method of claim 1 wherein said identifying at least one objea of interest 
incKidfs identifying the entire image as an objea of interest 

4. The method of claim 1 vtoein the image is a portion of a sequence of images 
comprising a video sequence of video frames, and wherein said generating a descriptive stream 
includes generating a descriptive stream for plural video frames in said video sequence. 
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5. Hie method of claim 4 ^i^ieiein the video frames are in ^^^'nHai oideria said 
video sequence. 

6. The method of claim 4 wherein the video frames are in non-seqnential order in said 
video sequence. 

7. A mediod of object-based desciipdon and linking of objects within a video 
sequence, wherein the video sequence includes phnal video frames, comprising: 

generating a descnptive stream, including a data block corresponding to a select 
video frame in the video sequence; 

identifying at least one object of interest in a video frame; 

inserting description/links into the data block for an object of interest; and 

recording a frame index at the beginning of each data block for synchronizing die 
description/links with the video sequence. 

8. The method of claim 7 wherein said inserting of deschption/links includes inserting 
description/links taken from the group of description/links consisting of object indexes, textual 
descriptions, voice annotation, image features, object links, URL links and Java applets. 

9. The method of claim 7 wherein said identifying at least one object of interest 
includes identifying the entire video frame as an object of interest 
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10. Tb& method of ^^a^"^ 7 wherein said generating a desciipiive stream includes 
generating a descriptive stream for plural video ftames in a video sequence. 

11. The method of claim 10 wherein the video frames are in sequeirtial order in a video 
sequence. 

12. The method of claim 10 wherein the video frames are in non-sequential order in a 
video sequence. 

13. A system for objea-based video description and linking of objects to an image, 

whCTein ttie image is represented by an electrical signal, comprising: 

an encoder for embedding a descriptive sneam with the electrical signal; 

a display mftcha"iq" for di^laying tiie image; 

a decoder for decoding the embedded descriptive stream; and 

a trigger mechanism for insttucting said decoder to decode and diq>lay said 

descriptive stream in a descriptive stream display at the request of a user, and for selecting, at the 

request of a user, a particular portion of die descriptive stream with which to woric 

14. The system of claim 13 which fuflher includes a capttire mechanism for generating 

die image as a sequence of video frames, and for converting said image imo a video signal 



14 



wo 98/47084 



PCT/JP98/01736 



15. The system of daim 14 wUch farther includes a trazismitter for transmitting said 
video signal and said embedded descriptive stream; and a receiver constructed and arranged for 
receiving said video signal and said embedded descriptive stream and for displaying a video 
image. 

16. The system of claim 14 wherein said capture mechanism is taken from the group 
consisting of video cameras and computers. 

17. The system of claim 13 wherein said trigger mechanism is located in a remote- 
control device. 

18. The system of claim 13 wherein said descriptive stream display is located in a 
remote-control device. 
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